Cost of Living Throughout America
Final Project
Data Science 1 with R (STAT 301-1)
Introduction
In a comprehensive exploration of the EPI Family Budget dataset, this report delves into the intricate dynamics of cost of living variations across geographical regions, shedding light on the nuanced relationships between family budgets, income disparities, and metro classifications in the United States.
I was originally motivated to perform this analysis, as I think it is interesting and beneficial to understand and see how the cost of living differs not only on the level of state vs state, but also looking further into the issue by seeing how cost of living differs by family size and county location. Additionally, I think this would be an opportunity to learn how incomes levels and expenses in each county and state differ from each other,starting to understand why these differences are present. I further think that by looking additionally at minimum wage in each state, I think that the analysis I conducted will bring attention to a lot of the inequity present in the United States of America.
In terms of initial curiosities while conducting early research on this data, I was interested in looking at how a median family income for each county then correlates to the total annual expenses on the state, regional, and metro levels. Furthermore, to then see if there are any patterns or trends in budget allocation that stand out or are different from others on multiple levels, such as the county, metro, state, and regional levels. Moving from looking at annual to monthly numbers, I was interested in seeing if there is any change between the two calculations, and if so, how that changes the overall cost of living in the different areas of interest that I have presently stated. Through having these starting curiosities, I was able to conduct a full exploration of the data that focused on comparisons and correlations to help bring insight into how different categories of expenses are valued and allocated in relation to the total expenses and cost of living in different geographical locations.
As stated above I will be using the Economic Policy Institute’s data on family budget, which also then corresponding to telling us about the cost of living in each county in America. This dataset provides insights into the average economical weights and costs of different aspects of life for each county in America both annually and monthly, whiles dividing these averages further by also looking at the different family types as well, ranging from 1 parent and no children families to 2 parent 4 children families. In order to enhance the dataset for my own research, I added information on the geographical region in which each county is located based on their state (south, midwest, northeast, and west) and added the minimum wage of each pair of state and county. The former information was sourced from the USA Census Website, and the latter was sourced from Paycom.com. See References for additional information and citation on the Economic Policy Institute’s dataset, as well as the extra information I obtain for the addition of my region and minimum wage variables.
In terms of the layout of my report, I will first discuss and provide an overview and quality check of my data, being descriptive of how the data looks and how I formatted it in the best way for my own research. I will then start my main explorations, where I will discuss on early univariate and bivariate a analyses in which I conducted, and then turn the attention towards three separate main questions that constructed and established the flow of my exploratory data analysis. Lastly, I will conclude with a summary of the main insights that I have founded throughout my research, as well as discussing potential directions that the analyses in which I conducted can be taken to the next level.
Data Overview & Quality
The FBI Family Budget dataset in its original state consisted of 27 variables and 31,430 observations. Within this, there were twenty-three numerical variables and four categorical variables. However, I did add my own minimum wage and regional variables, as well as the changed the variable type of the family type and metropolitan status variables. Additionally, I made sure that I tackled how I was going to work around the missingness that was in my dataset. Therefore, after further investigation, I realized that all of the missing values for my variables corresponded to one specific county and its multiple different family cases. Thus in this case, I decided it would be best to fully remove the observations of that particular county from my dataset, as I felt leaving it in would case more problems in terms of furthering my analysis than taking it out. Thus my updated version of the dataset includes 29 variables with 31,420 observations. Within this, there are six categorical variables and twenty-three numerical variables. Thus after my manipulation, the dataset is of high quality, being extremely well-made and will be easy to use during my data analyses as there are no underlying issues or problems.
Explorations
Welcome to the heart of my analysis – a comprehensive exploration of the EPI Family Budget dataset. In this section, we embark on a journey through the intricate layers of data, unraveling the complexities of cost-of-living variations across multiple different geographical facets in the United States.
Before diving into our main questions, let’s briefly revisit the insights gleaned from our preliminary analyses. Unraveling the individual and paired variables that provided me with a foundational understanding of the dataset’s landscape.
Additionally, I would like to preference that in this section, I have presented the most pivotal figures that encapsulate the core analyses driving my insights sand narrative. However, whiles these highlighted figures capture the essence of my findings, I acknowledge that a comprehensive view may be desired. Therefore, for the complete array of visuals generated during my explorations, including supplementary analyses and detailed breakdowns, please refer to Appendix I where a comprehensive collection of all figures not shown in this section will be displayed.
Univariate Analysis
In terms of my univariate analysis, I looked at both the categorical and numerical variables, finding the most interesting statistics and figures within my analysis of the different numerical categories of expenses.
However before looking into my categories of expenses, I believe it is important to highlight the difference in the amount of nonmetro areas to metro areas in the dataset to gauge if this geographical difference will have any impact of how I view and analyze my findings in the future.
Above in Figure 1 we see that there are a lot more instances of counties being in nonmetropolitan areas than to that of metropolitan areas. I am interested to see how this will affects aspects such as transportation and healthcare as there are heavy implications on how being further from a metro area can cause for more travel to gain necessitate items sometimes, as well as it seems that people who are further away from hospitals or don’t have such as an abundance of hospitals to them as though in extremely urban and metro areas, might go to the hospital less often. So I was really excited to look more into these relationships. Additionally, from this we can then also compare metropolitan areas in the south to that of to the north and same with nonmetropolitan areas in each region to gauge if geographical region matters more than metro status or vice versa.
Looking at the categories of expenses, I originally wanted to focus on and expand my research mostly on the total annual and monthly, transportation annual and monthly, healthcare annual and monthly, and housing annual and monthly costs. Below I have provided a brief explanation of the distribution of each expense at the national level. A breakdown of the other categories of expenses is can be founded in Appendix I - Univariate Analysis.
In Figure 2 we see that the distribution of healthcare annual expenses has a extremely large spread in comparison to the other variables at the annual level. Within that plot, there is seems to be a symmetric mutlimodal shape with the average costs of healthcare on the annual level being around $12000. However even outside of this average value, there are still smaller significant subgroups consisting of average healthcare costs being around $6000 and $20000. Some early potential reasons that I feel cause this distribution could correlate to family size and location, as well as how the minimum wage rate and median family income relate to these higher healthcare expenses. Expanding on this we then can look at distribution of annual housing cost and we see that there is a symmetric right-skewed distribution as most people tend to spend around $12000 on housing annually. I am surprised that there isn’t a larger spread, as I know that housing in cities tend to be more expensive than housing in non-metropolitan areas, as well as different regions have different housing market demands. The distribution of transportation expenses produces a unimodal right-skewed shape as on average most people spend $13000 a year on transportation costs. I am not surprised by the lack of spread in this distribution as most people regardless of location spend a lot of many on car expenses each year, however I wan to see if metropolitan status creates any difference at all in the type of distributions presented. Lastly, in terms of the annual variables, the distribution of annual total costs spent on a nationwide level has a bimodal and slightly right-skewed shape as on average most families spend around $60000 a year. Within this plot of total annual expenses, we expect to and see that although we have our average value, there is a lot of spread and variation away from this average that we most account for, relating to state and regional differences.
Turning our attention the distributions of the same variables above but now at the monthly level, we see as expected similar distributions trends to those in which I pointed out before. For example, looking at the distribution of healthcare costs monthly, we see lot of variability in the average expenses that healthcare is monthly, alongside a mutlimodal slightly right-skewed shape, with an average cost around $1200 a month. In terms of monthly housing expenses, the plot showcases that on average, families spend about $900 on housing, with some special cases of families spending over $2000 a month, as our distribution produces a unimodal right-skewed shape. For our transportation distribution, we see again a right-skewed unimodal shape with families on average spending $1200 each month. Lastly looking at total monthly expenses, we also see a pretty large spread in the amount that family types spend monthly at the national level, with the average being around $7000 and shape in the distribution of unimodal and right-skewed. From each of theses plots, the distributions are as expected both in comparison to the annual expenses distributions, as well as when thinking about how these and where the size of the spread for each of the distributions might occur.
Bivariate Analysis
– change this whole entire thing, sounds bad and not coherent
As I made my move to conducting my bivariate analysis,
I am still looking at the national level assessing to see now if there was any differences between the different types of expenses, both at the monthly and annual levels, in relation with metro status, family type, and geographical region. Overall, I found that for all three characteristics, the difference expenses did not seem to have much to any differences between them. Thus, I think that at a national level we can conclude that there is little to no difference between expenses in terms of metro status, family type, and geographical region, thus meaning that if there are differences present they would be more visual at the regional, state, and county regions. Below, I have provided figures of the relationship between total expenses for both annually and monthly to that of family type, metro status and geographical region to provide visual evidence to my claim.
talk the talk
talk the talk
intro to talking about total annual and monthly expenses at the three different levels and how people can see visualizations of the other main categories in my exploration thus far (transportation, housing, healthcare) in Appendix I - Bivariate Analysis
Looking at each set of plots, we see that the only real different currently is when we look at the different family types, the total expenses (both annual and monthly) distributions move more and more to the left, increasing. This makes senses as larger families tend to have more expenses. Otherwise on the geographical and metro status levels, we see little to no differences between total annual expenses. Thus moving the analysis more to the microlevel will help to find those currently hidden differences.
Conclusion statement on both the bivariate and univariate analyses: Currently I want to look at the main factors causing for the large spread in the our distributions of healthcare both on the monthly and annual level, while also seeing where the breakdown of the total expenses on the monthly and annual levels might break down based on region or state.
Additionally, I would like to see if families in states with higher minimum wages better able to meet their budget requirements, while also assessing if families in states with robust public transportation systems able to allocate less of their budget to transportation costs, with more of the expenses turning towards housing instead.
Furthermore, I am interested in looking at, if I implement this variable, how racial majority of an county corresponds to the main expenses of the citizens as well as the amounts that they spend. Within this I think it would be interesting to look at state policies and tax expenses as well and see if we can draw conclusions based on these factors as well.
From this early on analyses, I was able to form how I wished to go about my further explorations.
Now, with this groundwork established, we can turn our attention to the core questions that drive my exploration:
1. How does the cost of living vary across different geographical facets and metro classifications?
For myself: first look at the largest level so national, then regional, then the states in every region comparing them. followup question: Are there observable trends in transportation costs based on the availability and accessibility of public transportation in metro areas?
Our first inquiry delves understanding how the geographical tapestry and metro classifications of different areas in America help to uncover the intricate variations in the cost of living. Thus through this question, I aim to unravel the economic nuances that define household budgets.
To start my investigation for this question
Say an intro sentence here:
2. Are there discernible patterns or trends in budget allocation that stand
out, considering family types and monthly expenses?
Building on the foundation of my previous analyses, I now shift focus more heavily on the allocation of expenses. Exploring patterns and trends in budget allocation, I aim to identify distinctive markers that stand out amidst the diverse financial landscape pertaining to the different family types and the relationship between monthly and annual expenses. Thus, looking to see how family size impacts budget allocation. And if there are specific categories where larger families allocate a significantly higher percentage of their budget.
For me: I will work on the heatmap and see how that looks. Again look from the national, regional, and state levels. And then also I want to pick two states one being the best in a category and one being the worst and then look at how the counties compare for those two states as well. Follow-up question: What is the impact of the number of working adults in a family on budget allocation? Do dual-income families allocate their budgets differently compared to single-income families?
intro statement:
3. How does the minimum wage in different states correlate with the affordability of living, particularly in terms of housing, healthcare, and other essential expenses for various family types and regions?
Lastly, I turned my focus to looking at a critical economic variable – minimum wage. Through looking at minimum wages correlates to the cost of living, I aimed to shed light on how variations in minimum wage impact crucial aspects of family budgets, from housing and healthcare to other essential expenses.
Thus through answering these three esstenial questions, I was able to gain interesting insights on the following…..
Conclusions
State conclusions or insights. Were you surprised by things you found or were they as expected? Why? This is a great place for future work, new research questions, and next steps.
References
Economic Policy Institute (2022, March) Family Budget Map. https://www.epi.org/resources/budget/budget-map/
U.S. Census Bureau (2021, October 8). Census Regions and Divisions of the United States. https://www2.census.gov/geo/pdfs/maps-data/maps/reference/us_regdiv.pdf
Paycom (2023, October 2). Your 2023 Guide to Every State’s Minimum Wage. https://www.paycom.com/resources/blog/minimum-wage-rate-by-state/
Appendix I: Extra Explorations
Useful in cases where there are an over-abundance of explorations and they are not useful in the main body of the report or they are uninteresting, but still think readers should have access to them for reference.
Univariate Analysis
talk the talk
talk the talk